Appendix G — Assignment G

Data aggregation

Instructions

  1. You may talk to a friend, discuss the questions and potential directions for solving them. However, you need to write your own solutions and code separately, and not as a group activity.

  2. Write your code in the Code cells and your answer in the Markdown cells of the Jupyter notebook. Ensure that the solution is written neatly enough to understand and grade.

  3. Use Quarto to print the .ipynb file as HTML. You will need to open the command prompt, navigate to the directory containing the file, and use the command: quarto render filename.ipynb --to html. Submit the HTML file.

  4. The assignment is worth 100 points, and is due on 4th December 2023 at 11:59 pm. No extension is possible on this assignment due to tight grading deadlines.

  5. You are not allowed to use a for loop in this assignment.

  6. There is a bonus question worth 5 points.

  7. There are hints provided in a couple of questions. For each of those questions, if you can propose a solution that is shorter than the one proposed in the hint, you will get 10 ultra-bonus points. Email your solution to krish@northwestern.edu to claim the ultra-bonus points. A solution may be shorter in terms of number of computations. If it is shorter with respect to any other criterion, then the number of ultra-bonus points awarded will be a subjective decision.

  8. The maximum possible score in the assigment is 100 + 5 (bonus question) + 20 (two ultra-bonus questions) = 125 out of 100. There is no partial credit for bonus / ultra-bonus questions.

  9. If you are updating a dataset (imputing missing values / creating new variables), then use the updated dataset in a subsequent question.

  10. Five points are properly formatting the assignment. The breakdown is as follows:

  • Must be an HTML file rendered using Quarto (2 pts).
  • There aren’t excessively long outputs of extraneous information (e.g. no printouts of entire data frames without good reason, there aren’t long printouts of which iteration a loop is on, there aren’t long sections of commented-out code, etc.) (1 pt)
  • Final answers of each question are written in Markdown cells (1 pt).
  • There is no piece of unnecessary / redundant code, and no unnecessary / redundant text (1 pt)

G.1 Spotify

Read the spotify dataset from the file spotify_data.csv.

What percentage of all the unique tracks are contributed by the top 3 artists of each genre, where the top artists are based on artist_popularity, and the unique tracks are based on unique values of track_name? If there are multiple values of artist_popularity for an artist_name-genres combination, then consider the maximum value of artist_popularity.

The following approach will not work: If you group the data by genres, and filter the top 3 rows by artist_popularity, then you may not get 3 unique artists, as one artist can have multiple tracks.

(10 points)

Hint:

  1. Group the data by genres, and artist_name. Find the number of unique tracks (by track_name) and maximum artist_popularity for each group.

  2. The dataset obtained in (1) is at artist-genre level, i.e., each row corresponds to a unique artist_name-genres combination. Group that dataset by genres, and filter the top 3 rows of each group based on artist_popularity.

  3. Sum up the number of unique tracks of the dataset obtained in (2) and divide it by the total number of unique tracks in the original dataset.

Note: The functions nunique(), and agg() may be useful.

G.2 Canadian Fish Biodiversity

Read data from the file Canadian_Fish_Biodiversity.csv on Canvas. Each row records a unique fishing event from a 2013 sample of fish populations in Ontario, Canada. To analyze the results of these fishing surveys, we need to understand the dynamics of projects, sites, and geographic locations.

G.2.1 Top 3 projects

Each site (identified by the column SITEID) represents a time and place at which fishing events occurred. Sites are grouped into broader projects (identified by the column Project Name). We want to understand the scope of these projects.

Using groupby(), find the top three projects by number of unique sites.

(4 points)

G.2.2 Missing value imputation with groupby()

G.2.2.1 Number of missing values

How many values are missing for the air temperature column (Air Temperature (C))?

(1 point)

G.2.2.2 Missing value imputation: attempt 1

Using groupby(), impute the missing values of air temperature with the median air temperature of the corresponding water body (Waterbody Name) and Month.

(4 points)

G.2.2.3 Missing values remaining after attempt 1

How many missing values still remain for the air temperature column after the imputation in the previous question?

(1 point)

G.2.2.4 Missing value imputation: attempt 2

We will try to impute the remaining missing values for air temperature. Try to impute the remaining missing values of air temperature with the median air temperature of the corresponding project (Project Name) and Month.

(4 points)

G.2.2.5 Missing values remaining after attempt 2

How many missing values still remain for the air temperature column after the imputation in the previous question?

(1 point)

G.2.2.6 Air-water temperatures correlation

Find the correlation between air temperature and water temperature.

(1 point)

G.2.2.7 Missing values remaning after hypothetical attempt 3

As you found a high correlation between air temperature and water temperature, you can use water temperature to estimate the air temperature (using the trendline, like you did in assignment 5). Assuming you already did that, how many missing values will still remain for the air temperature column?

Note: Do not impute the missing values using the trendline, just assume you already did that.

(3 points)

G.2.2.8 Visualizing missing value imputation

Make a scatterplot of air temperature against water temperature. Highlight the points for which the air temperature was imputed in attempts 1 and 2 with a different color.

(8 points)

G.2.3 Living conditions

This section begins to investigate the living conditions of fish at different locations and time periods. Continue using the updated dataset with the imputed missing values in attempts 1 and 2 of the previous section.

G.2.3.1 Air-water temperatures: Summary statistics

Use a single groupby statement to view the minimum, mean, standard deviation, and maximum air temperature and water temperature for each project during the month of August (use the Month column).

(5 points)

G.2.3.2 Air-water temperatures: visualizing yearly trend

Make lineplots showing maximum air temperature and water temperature by Month and Region. To construct Region, use the Pandas function cut() to satisfy the following conditions:

  • Rows with a latitude lower than 42.4 should have Southern in the Region column
  • Rows with a latitude between 42.4 and 42.8 should have Central in the Region column
  • Rows with a latitude higher than 42.8 should have Northern in the Region column

You can have the month on the horizontal axis, the temperature on the vertical axis, different colors for different regions, and different styles (solid line / dotted line) to indicate air/water temperature.

Does anything in the visualization surprise you? Why or why not?

(14 points)

G.2.4 Fish diversity

Finally let’s focus on the stars of this survey—the fish, of course.

G.2.4.1 Top 3 species by Region

Let’s continue using our Region categorization. Find the top three fish Species in each region by Number Captured.

(10 points)

G.2.4.2 Species spread across Region

Are certain fish only found in some regions? Visualize how many species are in all three regions, how many are in two of three, and how many were only captured in one region.

(10 points)

G.2.4.3 Exclusive fishes by region

What percentage of all species are exclusively captured in the Southern region? How about the Northern Region? And the Central region?

(10 points)

Hint:

  1. Find the number of distinct regions in which each species is found.
  2. Filter the species that are found only in one region.
  3. Group the data, containing only the species found in (2), by region, count the number of unique species in each group, and divide by the total number of distinct species.

G.2.4.4 Turbidity

Turbidity (Turbidity (ntu)) quantifies the level of cloudiness in liquid. For fish in each of the three regions, is there a linear association between turbidity and number of fish caught? You may consider a correlation higher than 50% in magnitude as presence of a linear association.

(5 points)

G.2.4.5 Fish dimensions

Now let’s turn to the length of fish captured, given by Maximum (mm) and Minimum (mm). Find the overall maximum and minimum lengths of all fish in each region. Which region has the largest range in captured fish length?

(4 points)

G.2.4.6 Bonus question: Inverse Simpson index

Find the inverse Simpson index of species counts for each waterbody type (WaterbodyType) within each region. Which combination of waterbody type and region has the greatest diversity of fish species? Which has the least?

The inverse Simpson index \((\frac{1}{\lambda})\) is a measure of ecological diversity, for which a larger index number indicates a greater diversity of species. The index is calculated as:

\(\frac{1}{\lambda} = 1/(\sum^R_{i=1} p_i^2)\)

where \(R\) is the number of unique species and \(p_i\) is the proportion of fish belonging to species \(i\).

(5 points - no partial credit)